Data Analysis

Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends

Authors
Affiliation

Mahira Ayub

Boston University

Ava Godsy

Boston University

Joshua Lawrence

Boston University

import pandas as pd

# Load the dataset
data = pd.read_csv('data\lightcast_job_postings.csv')
# Extract Job_Postings table
job_postings = data[['ID', 'TITLE_RAW', 'TITLE_CLEAN', 'POSTED', 'EXPIRED',
                     'SALARY_FROM', 'SALARY_TO', 'MIN_YEARS_EXPERIENCE',
                     'MAX_YEARS_EXPERIENCE', 'SKILLS', 'SPECIALIZED_SKILLS',
                     'SOFTWARE_SKILLS', 'EMPLOYMENT_TYPE']]
job_postings.to_csv('output/job_postings.csv', index=False)

# Extract Company table
company= data[['COMPANY_NAME', 'COMPANY_RAW', 'COMPANY_IS_STAFFING']]
company["COMPANY_ID"] = company.index
company.to_csv('output/company.csv', index=False)
<>:4: SyntaxWarning:

invalid escape sequence '\l'

<>:4: SyntaxWarning:

invalid escape sequence '\l'

C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\2876480636.py:4: SyntaxWarning:

invalid escape sequence '\l'

C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\2876480636.py:4: DtypeWarning:

Columns (19,30) have mixed types. Specify dtype option on import or set low_memory=False.

C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\2876480636.py:14: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

0.1 Dropping Unncessary Columns

Answer the Questions + Which columns are irrelevant or redundant? + Why are we removing multiple versions of NAICS/SOC codes? + How will this improve analysis?

columns_to_drop = [
'ID',
'LAST_UPDATED_TIMESTAMP',
'DUPLICATES',
'NAICS2',
'NAICS2_NAME',
'NAICS3',
'NAICS3_NAME',
'NAICS4',
'NAICS4_NAME',
'NAICS5',
'NAICS5_NAME',
'NAICS6',
'NAICS6_NAME',
'SOC_2021_2',
'SOC_2021_2_NAME',
'SOC_2021_3',
'SOC_2021_3_NAME',
'SOC_2021_4',
'SOC_2021_4_NAME',
'SOC_2',
'SOC_2_NAME',
'SOC_3',
'SOC_3_NAME',
'SOC_4',
'SOC_4_NAME',
'SOC_5',
'SOC_5_NAME',
'NAICS_2022_2',
'NAICS_2022_2_NAME',
'NAICS_2022_3',
'NAICS_2022_3_NAME',
'NAICS_2022_4',
'NAICS_2022_4_NAME',
'NAICS_2022_5',
'NAICS_2022_5_NAME'
]
data.drop(columns=columns_to_drop, inplace=True)

0.2 Handling Missing Values

Answer the question: How should missing values be handled?

import missingno as msno
import matplotlib.pyplot as plt

# Visualize missing data
msno.heatmap(data)
plt.title("Missing Values Heatmap")
plt.show()

# Fill missing values
data["SALARY"].fillna(data["SALARY"].median(), inplace=True)
data["NAICS_2022_6_NAME"].fillna("Unknown", inplace=True)

C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\1556623005.py:10: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.



C:\Users\jt-la\AppData\Local\Temp\ipykernel_16632\1556623005.py:11: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


0.3 Remove Duplicates

data = data.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")

1 Exploratory Data Analysis

1.1 Job Postings by Industry

import plotly.express as px
fig = px.bar(data["NAICS_2022_6_NAME"].value_counts(), title="Job Postings by Industry")
fig.show()

Job Posting Anaylsis

1.2 Salary Distribution by Industry

fig = px.box(data, x="NAICS_2022_6_NAME", y="SALARY", title="Salary Distribution by Industry")
fig.show()

Job Posting Analysis

1.3 Remote vs. On-Site Jobs

fig = px.pie(data, names="REMOTE_TYPE_NAME", title="Remote vs. On-Site Jobs")
fig.show()

Remote vs. On-Site Jobs Analysis